Skip to content

quickwit: add tag_fields on CounterID, drop positions on raw text#877

Closed
alexey-milovidov wants to merge 48 commits into
add-quickwit-entryfrom
quickwit-tag-fields-record-basic
Closed

quickwit: add tag_fields on CounterID, drop positions on raw text#877
alexey-milovidov wants to merge 48 commits into
add-quickwit-entryfrom
quickwit-tag-fields-record-basic

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

Two index-level changes to quickwit/index_config.yaml, keeping the rest of the benchmark setup identical.

  • tag_fields: [CounterID] — Q37-Q43 all filter CounterID = 62. Tagging it writes the per-split CounterID values into the metastore so the searcher can prune whole splits before opening them. This is the closest analogue we get to Elasticsearch's index.sort early-termination on the same column. Quickwit/Tantivy has no real multi-column doc-sort to match the full ES sort.field: [CounterID, EventDate, UserID, EventTime, WatchID], so this picks up just the CounterID dimension.
  • record: basic on every tokenizer: raw text field (28 fields). Tantivy defaults text postings to WithFreqsAndPositions, but raw-tokenized fields only ever hold one term per document — phrase queries can't run against them, so freqs and positions are dead weight in the index.

Validated against the running v0.9.0-nightly server (the same image benchmark.sh uses): the tag_fields and record: basic settings round-trip cleanly through the index-create API.

Test plan

  • Re-run bash benchmark.sh end-to-end on a fresh machine
  • Compare cold + warm timings against the previous results, especially Q37–Q43 (CounterID filter) for the tag_fields benefit
  • Confirm load time and on-disk size — both should stay flat or shrink slightly thanks to record: basic

🤖 Generated with Claude Code

alexey-milovidov and others added 30 commits May 8, 2026 20:11
Some historical clickhouse-cloud entries stored cluster_size as a
JSON string ("1", "2", "3") while modern ones use plain integers
(1, 2, 3). The dashboard treats the two representations as distinct
values and renders cluster_size 2 and 3 twice in selectors. Convert
all string-numeric cluster_size values to integers across the repo.
Non-numeric strings (serverless, dedicated) are left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate-results.sh used to take, for every (system, basename), the
latest dated copy across all date subdirectories. That meant the
website would still surface a benchmark machine after the system was
re-run on a new date that no longer covers that machine.

Switch the rule: for each <system>/results/, find the lexicographically
greatest YYYYMMDD subdirectory and emit every file it contains. Older
subdirs remain in the repo as history but are not rendered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use only the latest date subdir of each system for the dashboard
Revert #874: restore previous generate-results.sh behavior
When all selected systems return null for a query, the per-query baseline
becomes Math.min() over an empty set (Infinity), which makes
log(curr/Infinity) = -Infinity and collapses every system's geometric
mean to 0 - bars render with width 0 and the chart appears empty.

Reproduction: filter to Elasticsearch + Quickwit (Q28's REGEXP_REPLACE
fails on both).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip queries that fail on every filtered system
The c8g.metal-48xl run was committed in 2b124ba ("Update
clickhouse-datalake-partitioned results", authored 2026-02-18) with
the date field accidentally set to "2027-02-18". The restructure in
bb91b0c then put it under results/20270218/ — making it the
lexicographically-latest dir despite containing a single machine,
which masked the real latest dir (20260506).

Move the file to results/20260218/ alongside the other 2026-02-18
results from the same commit, and correct the date field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identify obsolete results by comparing each system's older-dated
result files against the canonical pre-refactor flat layout
(`bb91b0cf5~1`, the commit just before the date-subdir restructure).
Any older-dated file whose basename was not in that flat layout
represents a machine/configuration that had already been deleted from
the canonical state — mark those `"historical"` so the dashboard
doesn't surface them.

This catches:
- Old ClickHouse Cloud configurations (Dedicated, colder-cache and
  parallel-replicas experiments, retired size tiers 40 / 56 / 80 /
  128 / 240 GiB).
- Old ClickHouse hardware (c5.4xlarge, m5d.24xlarge, m6i.32xlarge,
  *.zstd, *.tuned, *.tuned.memory, c5n.4xlarge for clickhouse-web).
- Per-system retired runs (DataFusion `f16s_v2` and old
  `single.json`, MotherDuck `result.json`/`result_*`/`pulse`/
  `standard`, polars and polars-dataframe retired filenames,
  starrocks `*.untuned`, paradedb 1500GB, hydra `c6a.4xlarge` (now
  on `hydra.json`), etc).

Also drops tags from older results that aren't in the union of tags
in the system's latest dated subdir, except `"historical"` (catches
residuals like `"analytical"`, `"MySQL compatible"` in old databend,
`"Python"` in arc, `"open-source"`/`"dataframe"`/`"parquet"` in
polars, etc — applying the rules from earlier tag-removal commits
d661b49 / 46a535b / fb09092 / ae85f0d / 0aab48e to the historical
copies that still carried the deprecated tags).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the manual arch-detection + zip download with the official
installer at install.gizmosql.com, mirroring the pattern DuckDB uses
in this repo. The installer handles arch/OS detection and installs
to ~/.local/bin by default, which we then prepend to PATH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3a.small

- datafusion/results/<YYYYMMDD>/single.json renamed to c6a.4xlarge.json
  to match the per-machine naming used everywhere else; the historical
  tag is removed from those files since they no longer represent an
  obsolete basename.
- datafusion/results/20250522/single.json deleted as redundant —
  c6a.4xlarge.json already exists in the same dir with identical
  metadata and identical numeric results (the only diffs are
  trailing-zero formatting).
- duckdb-vortex/results/20250521/c6a.4xlarge-single.json deleted for
  the same reason — same date / system / machine / metadata as the
  canonical c6a.4xlarge.json next to it.
- firebolt-parquet{,-partitioned}/results/20260221/t3a.small.json
  removed entirely; those entries were incorrect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both systems had genuine standalone runs on AWS hardware that were
incorrectly tagged "historical" by the pre-refactor flat-layout
heuristic — the flat layout only kept the most recent canonical
machine per system, so older one-off machines looked obsolete even
though the run is still meaningful as a historical comparison point.

- glaredb/results/20240202/c6a.metal.json — drop historical
- hydra/results/{20221209,20230919}/c6a.4xlarge.json — drop historical

Also delete glaredb/results/20250525/c6a.4xlarge-parquet-single.json
as redundant (same date / system / machine / metadata as the canonical
c6a.4xlarge.json next to it; numerical results identical, only
trailing-zero formatting differs — same situation as the
datafusion/duckdb-vortex *-single duplicates removed in the previous
commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- motherduck/results/{20240127,20241029}/result.json renamed to
  result_standard.json. The runs were originally machine="cloud"
  (back when Motherduck only offered one tier); update machine to
  "Motherduck: standard" to match current naming and drop the
  historical tag.
- paradedb/results/20240202/c6a.4xlarge.1500gb.json deleted —
  identical results to c6a.4xlarge.json next to it; the .1500gb
  filename was a one-off clarification for an Elasticsearch comparison
  per its comment field. The canonical c6a.4xlarge.json carries the
  same numbers without that side-comment.
- paradedb/results/20240713/single.json deleted — same date / system /
  machine / load_time / data_size as the canonical c6a.4xlarge.json
  next to it; results differ only by tiny numerical noise (<= 0.001s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- polars/results/20241129/DataFrame_c6a.metal.json moved to
  polars-dataframe/results/20241129/c6a.metal.json (the run is
  system="Polars (DataFrame)", so it belongs in polars-dataframe).
- polars/results/{20241129,20241215}/parquet_c6a.metal.json /
  parquet_c6a.4xlarge.json renamed to drop the parquet_ prefix
  (parquet is the default encoding for polars/, so the prefix is
  redundant — polars-dataframe/ is the dataframe variant).
- Historical tag dropped from all three renamed files.

The pre-existing canonical c6a.metal.json / c6a.4xlarge.json in those
date dirs were re-runs that ended up there because their date field
wasn't updated when the data was refreshed in commit 69d3e50;
the renamed files carry the actual 2024-11-29 / 2024-12-15 numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- starrocks/results/{20220715,20220925}/*.untuned.json — old untuned
  variant from when both tuned and untuned runs were captured
  separately. The canonical c6a.4xlarge.json / c6a.metal.json next
  to them already record an untuned run (tuned="no") with the
  modern schema.
- timescaledb/results/20220701/c6a.4xlarge.compression.json — old
  compression-on variant; the canonical c6a.4xlarge.json carries
  the standard TimescaleDB run for that date.
- trino{,-partitioned}/results/202605{06,07}/c8g*.json — c8g runs
  removed entirely (per maintainer instruction).
- umbra/results/20251026/c6a.{2xlarge,xlarge}.json — incorrect
  results, removed entirely.
- arc/results/2025*/m3_max*.json — m3_max runs removed entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…large,metal-48xl}

- starrocks/results/{20220715,20220925}/{c6a.4xlarge,c6a.metal}.json
  replaced with the content of the previously-deleted *.untuned.json
  files. The untuned numbers are the right canonical record for those
  dates (the prior "tuned" canonical was a parallel run that wasn't
  the one used to establish the historical entry). Drops the
  "historical" tag and the "ClickHouse derivative" tag (not in latest
  starrocks tag set), keeps system="StarRocks".
- trino{,-partitioned}/results/20260507/c8g.4xlarge.json and
  c8g.metal-48xl.json restored. Per maintainer note, only
  c8g.24xlarge.json was supposed to be removed; the other two c8g
  variants stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mark stale results historical to clean up the dashboard
motherduck/ uses lowercase tier names ("Motherduck: jumbo",
"Motherduck: mega", "Motherduck: standard"); pg_duckdb-motherduck/
had three files with "Motherduck: Jumbo" (capital J). Lower-case the
J so the dashboard groups all jumbo-tier runs under one machine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rt sizes

For cloud-service results whose .machine value contains a memory size
(GB / GiB) or a T-shirt size (XS / S / M / L / XL / NXL etc), drop
the redundant cloud-name prefix so the dashboard groups runs by the
actual size rather than the (system, machine) tuple. The system field
on each entry already carries the cloud name; repeating it inside
.machine just bloats the X axis.

Also normalize T-shirt sizing variants to their letter form:
  Small → S, Medium → M, Large → L,
  X-Small → XS, X-Large → XL,
  2X-Small → 2XS, 2X-Large → 2XL, 3X-Large → 3XL, 4X-Large → 4XL,
  5X-Large → 5XL.

Affected systems: AlloyDB, ByteHouse, CHYT, ClickHouse Cloud
(every aws/azure/gcp tier), CrunchyBridge, Databricks, Hydra,
Snowflake, Supabase, Tablespace, Timescale Cloud, pgpro_tam.

Bare-metal hardware descriptions (CPU model + RAM, "AWS c5.metal
100GB", etc) are left unchanged — the rule applies to managed-cloud
machine labels only.

Aurora's "16acu", Hologres' "16 CU", Redshift's "ra3.4xlarge", and
SingleStore's "S2"/"S24" don't match the GB or T-shirt-size pattern
and are also left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert "<digits><space?>GB" → "<digits>GiB" in cloud-service machine
names. Where the value also carries an "<N> vCPU " prefix in front
of the GB amount, drop that prefix — the GiB tier already conveys
the size, so "8 vCPU 64 GB" simplifies to "64GiB".

Examples:
- "8 vCPU 64 GB" (AlloyDB) → "64GiB"
- "10 vCPU 40GB" (CHYT) → "40GiB"
- "720GB" (CHYT) → "720GiB"
- "Analytics-256GB" (Crunchy Bridge) → "Analytics-256GiB"
- "L1 - 16CPU 32GB" (Tablespace) → "L1 - 16CPU 32GiB"
  (16CPU is not "vCPU" so it stays)
- "8 vCPU 32GB" (Timescale ☁️) → "32GiB"
- "16 vCPU 32GB" / "30 vCPU 480GB" (pgpro_tam) → "32GiB" / "480GiB"
- "64 vCPU 256GB" (YDB) → "256GiB"

Bare-metal hardware descriptions in hardware/, versions/, gravitons/
(e.g. "AWS c5.metal 100GB", "Linode 16GB", "Steam Deck 512 GB",
"AMD EPYC 3.2 GHz, Micron 5100 MAX 960 GB") are left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize machine names: drop redundant cloud prefix, normalize T-shirt sizes
github and others added 16 commits May 8, 2026 20:05
The c7i.metal-48xl runs in chdb / chdb-dataframe /
chdb-parquet-partitioned were one-off captures that aren't part of
the canonical machine set (the canonical chdb suite uses c6a / c7a /
c8g variants). Tag them "historical" so they stop appearing on the
dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"Analytics-256GiB" → "256GiB". The system field already says
"Crunchy Bridge (Parquet)", so the "Analytics-" prefix is redundant
once the cloud-name has been dropped from the machine label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep only the RAM size as the machine label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GizmoSQL: use the official one-line installer
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 9, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ alexey-milovidov
✅ rschu1ze
✅ prmoore77
❌ github


github seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

alexey-milovidov and others added 2 commits May 9, 2026 11:39
tag_fields: [CounterID] writes per-split CounterID values into the
metastore so the searcher can prune whole splits before opening them
for queries 37-43, which all filter CounterID = 62 — the closest
analogue to Elasticsearch's index.sort early-termination here.

record: basic on every tokenizer: raw text field skips storing freqs
and positions in the postings; phrase queries can never run against
single-term raw fields, so the data was dead weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov force-pushed the quickwit-tag-fields-record-basic branch from 76a4092 to 5bb3d7e Compare May 9, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants